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Abstract — Sparse coding in learned dictionaries has been 
established as a successful approach for signal denoising, 
source separation and solving inverse problems in general. A 
dictionary learning method adapts an initial dictionary to a 
particular signal class by iteratively computing an approximate 
factorization of a training data matrix into a dictionary and 
a sparse coding matrix. The learned dictionary is charac- 
terized by two properties: the coherence of the dictionary 
to observations of the signal class, and the self-coherence of 
the dictionary atoms. A high coherence to the signal class 
enables the sparse coding of signal observations with a small 
approximation error, while a low self-coherence of the atoms 
guarantees atom recovery and a more rapid residual error 
decay rate for the sparse coding algorithm. The two goals 
of high signal coherence and low self-coherence are typically 
in conflict, therefore one seeks a trade-off between them, 
depending on the application. We present a dictionary learning 
method with an effective control over the self-coherence of the 
trained dictionary, enabling a trade-off between maximizing 
the sparsity of codings and approximating an equiangular tight 
frame. 

Index Terms — Dictionary learning, sparse coding, coherence. 



I. Introduction 

Dictionary learning adapts an initial dictionary to a par- 
ticular signal class with the help of training observations, 
such that further observations from that class can be sparsely 
coded in the trained dictionary with low approximation error. 
Over-complete dictionaries, consisting of more atoms than 
dimensions of the feature space, typically support sparser 
codings by placing more atoms in densely populated regions 
of the feature space. However, this redundancy increases the 
self-coherence of the dictionary, i.e. the pairwise similarity 
of dictionary atoms, as measured by the cosine of the 
angle between atom pairs. A lower self-coherence permits 
better support recovery J2) and a more rapid decay of the 
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residual norm when increasing the coding cardinality ifTZl . 
Furthermore, bounding the admissible self-coherence during 
training can increase the generalization performance of the 
dictionary, by avoiding over-fitting to the training data and 
by avoiding atom degeneracy, i.e. two atoms collapsing onto 
the same vector. 

We present a dictionary learning algorithm called IDL(7), 
which enables an effective control over the self-coherence 
of trained dictionaries. Our method is able to span the 
full spectrum of optimization objectives, from maximizing 
the sparsity of the resulting codings, to approximating an 
equiangular tight frame (ETF), which is a dictionary achiev- 
ing minimal self-coherence for a given number of atoms. 
We demonstrate the benefits of limiting the self-coherence 
of the dictionary in terms of better coding support recovery 
and improved generalization performance (see Sec. IHIV 

A. From Bases to Over-Complete Dictionaries 

An orthonormal basis B G M. DxD contains D mutually 
orthogonal unit I2 norm atoms spanning the feature space 
M. D . The unique code c 6 M. D of an observation x e R D is 
computed by c = B T x (signal analysis), and the signal is 
recovered from the code by x = Be (signal synthesis). The 
Gram matrix G = B T B = I of B is the identity matrix. 

Although natural signals are approximately sparse in suit- 
ably chosen bases, typically a sparser code can be achieved 
using an over-complete dictionary D 6 R-° xL , with L > D 
unit £2 norm atoms, by placing more atoms in densely 
populated regions of the feature space. However, due to the 
redundant number of atoms, coding x in D no longer has a 
unique solution. Therefore, signal analysis in over-complete 
dictionaries needs to be performed using a sparse coding 
algorithm, such as orthogonal matching pursuit (OMP) J§). 

The non-orthogonality of atoms is measured by the self- 
coherence of the dictionary, which can be defined as the 
maximum magnitude over all off-diagonal elements of the 
Gram matrix G = D T D, 
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(1) 

It therefore holds that /i(D) e [0, 1]. Note that this definition 
of the dictionary self-coherence can be misleading when 
most inner products have small magnitudes [12|. Therefore, 
the full inner product distribution is considered in Sec. |TTT] 
D has minimum self-coherence for a given dimension 
D and dictionary size L if the magnitudes of all the off- 
diagonal elements of G are equal (see Thm. Q] below). In 
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this case, the dictionary is called an equiangular tight-frame 
(ETF) 031. Formally, E e M Dxi is an ETF if there exists 
an a, < a < tt/2, such that 

l e M) e (:,e)l = cos(a), d^e, (2) 

and if 

EE T = —I. (3) 

D 

Therefore, E has D non-zero singular values all equal to 
y/L/D. The following theorem establishes a lower bound 
on the minimum of the self-coherence. 

Theorem 1. 4701 Theorem 2.3] The self-coherence of a 
dictionary D E R £)xi with unit li norm atoms is bounded 
from below by 

Equality holds if and only ifT> is an ETF and L < D(D + 
l)/2. ' 

The self-coherence of a dictionary influences the recovery 
of the sparse coding support of a signal observation, i.e. the 
set of atoms that are associated with the non-zero coding 
coefficients. The exact recovery condition (ERC) |3] states 
that, assuming that the observation in fact has an exact sparse 
coding c in D, the support of c is recovered if 

Furthermore, /i(D) also upper bounds the residual error 
norm decay curve in iterative sparse coding algorithms such 
as OMP 02. 

B. Related Work 

Yaghoobi et al. proposed a design algorithm for paramet- 
ric dictionaries [13|. A parametric dictionary Dr consists 
of atoms which have a specific functional form controlled 
by a small number of parameters. The proposed algorithm 
accepts a given Dr as its input, and optimizes it such that 
its Gram matrix approximates the optimal properties of an 
ETF. However, this approach relies on expert knowledge 
for choosing the appropriate parametric family for a given 
application, and provides no mechanism to adapt Dr if the 
signal characteristics are not known in advance. Therefore, 
an analytic dictionary design approach is for instance not 
suited to source separation of partially coherent sources [9|. 

The K-S VD algorithm 1 1 ] adapts a non-parametric dic- 
tionary to training data. In each iteration of the algorithm, 
those atoms are replaced which have a too high coherence 
to another atom in the dictionary. If the coherence to another 
atom lies above a threshold the atom is replaced by a 
training observation which does not have a sparse repre- 
sentation in the current dictionary. Therefore, the likelihood 
that the replacement atom is less coherent to the dictionary 
is high. However, if multiple atoms are replaced (which is 
almost always the case in practice), this strategy does not 



guarantee that the dictionary self-coherence falls below fj, t . 
In our experiments, an effective control over the dictionary 
self-coherence using the proposed atom thresholding step 
was not possible (see Sec. [TTTb . 

Very recently, and independently from our own work, 
Mailhi^ce et al. |7'| proposed a more sophisticated atom 
decorrelation step for the K-SVD algorithm called INK- 
SVD, where pairs of atoms are decorrelated until the dic- 
tionary satisfies the maximum inner product bound ([T}. 
After the dictionary update step of the K-SVD algorithm is 
complete, each pair of atoms which has a coherence above 
the threshold \i t has its inner angle increased symmetrically 
until the threshold is satisfied. Because this procedure can 
inadvertently increase the coherence to other atoms, the 
pairwise decorrelation step has to be iterated until the self- 
coherence threshold is satisfied for the complete dictionary. 
Unfortunately, due to this fact the number of necessary 
decorrelation steps can grow very large if a small \i t is 
enforced (see Sec. ITTTb . 

C. Our Contribution 

We present a dictionary learning algorithm where a bound 
on the dictionary self-coherence is enforced directly in the 
atom update step. Instead of bounding the maximum inner 
product ([TJ as in the INK-SVD algorithm, our algorithm 
enforces an upper bound on the sum of squared inner product 
values. By varying a Lagrange multiplier 7, it is possible to 
realize any trade-off between maximizing the sparsity of the 
code and minimizing the self-coherence of the dictionary. 

Since IDL(7) maximizes the coherence of a dictionary 
to a particular signal class, prior expert knowledge to 
choose the right parametric dictionary family and parameter 
discretization is not necessary. Furthermore, the IDL(7) 
algorithm makes it possible to train an incoherent dictionary 
even if the number of atoms is large compared to the 
dimensionality of the signal space. And last but not least, 
we empirically demonstrate for a speech coding task that 
training an incoherent dictionary using IDL(7) improves the 
sparse coding fidelity of the dictionary on unseen test data. 

II. Method 

A dictionary learning algorithm approximately factorizes a 
data matrix X 6 M. DxN into a dictionary matrix D <E R DxL 
and a coding matrix C S M. LxN . The algorithm minimizes 
the approximation error 

argmin||X-D.C||^, (6) 

measured by the squared Frobenius norm, subject to a 
sparsity constraint on C and a unit £2 norm constraint on 
the atoms (columns) of D. Since © is not jointly convex 
in D and C, many proposed algorithms employ alternating 
minimization w.r.t. C and D until convergence to a local 
optimum. In the following, we focus our discussion on the 
dictionary update step. 
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The K-SVD algorithm minimizes © for each atom inde- 
pendently. Given the newly updated dictionary, if there exist 
atoms d(. d j and d(. e j, such that 

|d(Ud (! , e) |>0t d ^ e ( ? > 

d( : e ) is replaced by X(. n )/||x( :n )||2, where n is chosen such 
that ||x(. „) — Dc(. n )||2 is large. Since observations having a 
large approximation error are likely incoherent to the current 
dictionary, the replacement atoms likely have a coherence 
below fi t to all atoms already in the dictionary. However, if 
more than one atom is replaced, the coherence between the 
replacement atoms can potentially be large. This approach 
therefore does not guarantee that the self-coherence of the 
updated dictionary falls below fit- 

Although updating atoms independently of each other is 
computationally efficient, it is not well suited to enforc- 
ing a self-coherence constraint, which introduces additional 
dependencies between all atoms. We propose a dictionary 
update step where the atoms are jointly optimized, and 
the dictionary self-coherence is minimized along with the 
approximation error. 

Thm. 1 motivates our choice to augment the minimization 
of the objective © w.r.t. D with a self-coherence penalty, 

argmin||X-DC||^+7||D T D-I|||, (8) 

where the Lagrange multiplier 7 controls the trade-off be- 
tween minimizing the approximation error and minimizing 
the self-coherence. The second term in (|8) penalizes both the 
average coherence between atoms, as well as a divergence 
from the unit £2 norm of each atom. However, we still en- 
force the strict unit £2 norm constraint after the optimization 
by rescaling each atom. 

The gradient of © w.r.t. D is computed by a trace op- 
erator expansion, ||A||^ = tr{A T A}, of the approximation 
error term of (|8}, 

tr {C T D T DC} - 2tr {X T DC} + tr {X T X} , (9) 

and the self-coherence penalty term of (|8) 

tr{D T DD T D} -2tr{D T D} +tr{I}. (10) 

Taking the partial matrix derivatives of (|9) and ([10} w.r.t. D 
results in the gradient 

2 (DCC T - XC T ) + 4 7 (DD T D - D) , (11) 

see e.g. [6| how to take partial derivatives of the trace 
operator. 

It is not necessary to find the global minimizer of dHJ, as 
long as the objective is sufficiently reduced in each iteration 
of the dictionary learning algorithm. We therefore run only 
a few iterations of the limited-memory BFGS algorithm 0, 
which successively builds an approximation to the Hessian 
(i.e. the matrix of second order partial derivatives) from 
evaluating the objective © and the gradient (fTTT i, without 
directly computing the Hessian matrix (which is infeasible 
for large dictionaries). 



III. Experiments 

We compare the proposed dictionary learning algorithm, 
denoted IDL(7), to the K-SVD algorithm with atom replace- 
ment and the INK-SVD algorithm. The difference of our 
algorithm lies in the dictionary update: it jointly minimizes 
both the data approximation error and the coherence of all 
pairs of atoms. In contrast, the K-SVD and the INK-SVD 
algorithm first perform a dictionary update step to minimize 
the data approximation error, and then sequentially minimize 
the coherence of pairs of atoms. 

The effectiveness of all algorithms to upper bound the 
dictionary self-coherence was evaluated for a speech coding 
task, as follows. The audio recordings of the first male 
speaker of the GRIE0 corpus were randomly sub-sampled to 
obtain N — 30000 training signals, each D = 160 samples 
long. A dictionary with L = 1000 atoms was initialized 
using random sampling of training observations. The LARC 
algorithm [9 | (an extension of the LARS algorithm [4]) was 
used for the sparse coding step of all dictionary learning 
algorithms, with the LARC residual coherence threshold set 
to /idi = 0.2 (not to be confused with the self coherence 
threshold fit). The number of dictionary learning iterations 
was set to 25, which resulted in approximate convergence 
to a local optimum in all experiments. 

Figure [Uplots the singular value spectra of the trained dic- 
tionaries. As a reference, the constant line at y/L/D = 2.5 
indicates the flat spectrum of a corresponding ETF. For the 
K-SVD algorithm (left figure), setting fit = 1 implies that 
the upper bound on the self-coherence is inactive. Note that 
decreasing fi t below unity proved to be counterproductive, 
i.e. the singular value spectrum decreases even more rapidly. 
As desired, lowering fi t for the INK-SVD algorithm resulted 
in a flatter spectrum (middle figure), but the computational 
cost is increasingly dominated by the growing number 
of decorrelation steps. Thus we were unable to train a 
dictionary with fi t = 0.1 (or smaller) in the available 
time frame (24 hours on an Intel Core 2 Duo CPU). The 
results for IDL(7) (right figure) show that by increasing the 
influence of the self-coherence penalty in (0, it is possible 
to approximate the flat spectrum of an ETF. Setting 7 > 50 
resulted in even flatter spectra (not shown). Atom coherence 
histograms and atom recovery percentages are available from 
the paper companion webpagqS 

Figure [2] plots the generalization performance of the 
trained dictionaries, in terms of the trade-off between the 
residual norm and the cardinality of the coding. Twenty 
test utterances were coded using OMP with a cardinality 
stopping criterion, and the median residual norm is reported. 
For the K-SVD algorithm, decreasing fi t < 1 resulted in 
a deteriorating generalization performance. For the INK- 
SVD algorithm, decreasing the residual norm is possible for 
0.7 > fit > 0.2 at cardinalities beyond 80, but only at the 
cost of increasing the residual norm at smaller cardinalities. 
While the curves are nearly identical for all algorithms if no 

1 http ://www. dcs . shef . ac.uk/spandh/gridcorpus/ 
- http ://sigg-iten . ch/research/sp!20 1 21 
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Figure 1. Singular value spectra of the trained dictionaries, as a function of the self-coherence constraint. A flatter spectrum indicates a less coherent 
dictionary. As a reference, the constant line indicates the flat spectrum of the corresponding ETF at ^/L/D = 2.5. 
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Figure 2. Generalization performance of the trained dictionaries, as a function of the self-coherence constraint. Smaller values indicate a better trade-off 
between the residual norm and the coding cardinality on test data not seen during training. 



coherence bound is enforced, the generalization performance 
improves consistently only in the case of IDL(7). We con- 
jecture that the difference is due to joint minimization of the 
residual norm and the dictionary self-coherence in IDL(7), 
whereas the atom decorrelation of K-SVD and INK-SVD is 
independent of the dictionary update. 

IV. Conclusions and Discussion 

We present a dictionary learning algorithm which enables 
an effective control over the self-coherence of the trained 
dictionary, enabling a trade-off between maximizing the 
sparsity of the code and approximating an equiangular tight 
frame. Neither a simple replacement of too similar atoms or 
a pairwise decorrelation of atoms can both effectively and 
efficiently control the dictionary self-coherence. We propose 
a joint atom update step instead, simultaneously minimizing 
the approximation error and the coherence of all pairs of 
atoms. 

We show for a speech coding task that our method is 
able to achieve the full range of optimization objectives, 
from maximizing the coding sparsity to approximating the 
properties of an ETF. Furthermore, we demonstrate the 
benefits of bounding the dictionary self-coherence on the 
generalization performance of the dictionary. 
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